-
Notifications
You must be signed in to change notification settings - Fork 498
[FLINK-32033][Kubernetes-Operator] Fix Lifecycle Status of FlinkDeployment Resource in case of MISSING/ERROR JM status #997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…SSING/ERROR JM status
…SSING/ERROR JM status with unrecoverable error
…SSING/ERROR JM status with unrecoverable error
| && (error.toLowerCase() | ||
| .contains( | ||
| "it is possible that the job has finished or terminally failed, or the configmaps have been deleted") | ||
| || error.toLowerCase().contains("manual restore required") | ||
| || error.toLowerCase().contains("ha metadata not available") | ||
| || error.toLowerCase() | ||
| .contains( | ||
| "ha data is not available to make stateful upgrades"))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we checking this specific error?
In any case we are the ones triggering this error so please create a constant in the AbstractJobReconciler and use that here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora this seems to be the only case when we know that the cluster cannot recover on its own and needs a manual restore. hence used this. Will set this as a constant instead for a cleaner code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
There are multiple instances where
HA metadata not availableis written in different forms likeHA metadata not availableandHA data is not available. Should we maintain a uniformity in these by changing these exception messages using a constant (now that it is available). -
Also currently
flink-operator-apidoes not haveflink-operatoras a dependency -> to use the constants inAbstractJobReconcilerwe would have to import it as a dependency as the status change logic resides inflink-operator-api.
Should I still go ahead with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If possible let's use a single constant, and we can keep that constant in the operator api module so the reconciler can use it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gyfora
I have added 3 constants for error messages which are frequently used and would mean that they are terminal, and referenced those in the reconcilers to maintain uniformity. I have also tried to keep the net changes minimum (Although a few error messages would differ slightly). Do let me know if this looks good?
…ing constants to maintain uniformity.
…ing constants to maintain uniformity.
…ing constants to maintain uniformity.
|
@gyfora I have corrected the tests that were failing in the above run due to string changes. Validated on local. All tests passed. Can you please rerun the build? |
|
@gyfora The branch was not updated hence had to update it so this would need to be built again. Also kindly, do let me know if there are any further comments on the changes. Will get on top of it , if any. |
What is the purpose of the change
Currently the lifecycle state shows STABLE even if an application deployment was deleted and stays in MISSING / reconciling state. This would fix the lifecycle status of the deployment which would be FAILED in cases when the JM is in MISSING/ERROR state with ERROR:
configmaps have been deletedindicating TERMINAL error.Brief change log
FlinkDeploymentscenariosJobManagerStatus:({"type":"org.apache.flink.kubernetes.operator.exception.UpgradeFailureException","message":"HA metadata not available to restore from last state. It is possible that the job has finished or terminally failed, or the configmaps have been deleted. ","additionalMetadata":{},"throwableList":[]})→ Return FAILED lifecycle stateVerifying this change
This change added tests and can be verified as follows:
ResourceLifecycleMetricsTestto validate Lifecycle Status forflinkDeploymentflinkDeploymentJobManager Deployment-> CausedMissingJM Status -> FAILED Lifecycle Status offlinkDeploymentJobmanager Deployment Status:DEPLOYING-> Hence the Lifecycle Status isSTABLEDoes this pull request potentially affect one of the following parts:
CustomResourceDescriptors: noDocumentation